GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs

نویسندگان

Kaixi Hou

Hao Wang

Wu-chun Feng

چکیده

Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through the memory hierarchy of the GPU to improve performance. However, approaches to take advantage of such blocking require complex and tedious changes to the GPU kernels for different stencils, GPU architectures, and multi-level cached systems. In this work, we explore the challenges of different spatial blocking strategies over three cache levels of the GPU (i.e., L1 cache, scratchpad memory, and registers) and propose a framework GPUUniCache to automatically generate codes to access buffered data in the cached systems of GPUs. Based on the characteristics of spatial blocking over various stencil kernels, we generalize the patterns of data communication, index conversion, and synchronization (with abstracted ISA-friendly interfaces) and map them to different architectures with highly optimized code variants. Our approach greatly simplifies the design of efficient and portable stencil computations across GPUs. Compared to stencil kernels based on hardware-managed memory (L1 cache) and other state-of-theart GPU benchmarks, the GPU-UniCache can achieve significant improvements. CCS CONCEPTS •Computingmethodologies→Vector / streaming algorithms; •Software and its engineering→ Domain specific languages;

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accelerating high-order WENO schemes using two heterogeneous GPUs

A double-GPU code is developed to accelerate WENO schemes. The test problem is a compressible viscous flow. The convective terms are discretized using third- to ninth-order WENO schemes and the viscous terms are discretized by the standard fourth-order central scheme. The code written in CUDA programming language is developed by modifying a single-GPU code. The OpenMP library is used for parall...

متن کامل

The Promises of Hybrid Hexagonal/Classical Tiling for GPU

Time-tiling is necessary for e cient execution of iterative stencil computations. But the usual hyper-rectangular tiles cannot be used because of positive/negative dependence distances along the stencil's spatial dimensions. Several prior e orts have addressed this issue. However, known techniques trade enhanced data reuse for other causes of ine ciency, such as unbalanced parallelism, redundan...

متن کامل

Automatic Code Generation and Adaptive Grid Scheduling for GPU Cluster Computing

Recent advances in GPUs (graphics processing units) lead to massively parallel hardware that is easily programmable and widely applied in areas which require intensive computation besides graphics acceleration. The appearance of GPU clusters gains popularity in the scientific computing community, and the study on GPU clusters becomes an increasingly hot issue. While extending a singleGPU system...

متن کامل

An approach to Improve Particle Swarm Optimization Algorithm Using CUDA

The time consumption in solving computationally heavy problems has always been a concern for computer programmers. Due to simplicity of its implementation, the PSO (Particle Swarm Optimization) is a suitable meta-heuristic algorithm for solving computationally heavy problems. However, despite the simplicity, the algorithm is inefficient for solving real computationally heavy problems but the pr...

متن کامل

Improvement of generative adversarial networks for automatic text-to-image generation

This research is related to the use of deep learning tools and image processing technology in the automatic generation of images from text. Previous researches have used one sentence to produce images. In this research, a memory-based hierarchical model is presented that uses three different descriptions that are presented in the form of sentences to produce and improve the image. The proposed ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs

نویسندگان

چکیده

منابع مشابه

Accelerating high-order WENO schemes using two heterogeneous GPUs

The Promises of Hybrid Hexagonal/Classical Tiling for GPU

Automatic Code Generation and Adaptive Grid Scheduling for GPU Cluster Computing

An approach to Improve Particle Swarm Optimization Algorithm Using CUDA

Improvement of generative adversarial networks for automatic text-to-image generation

عنوان ژورنال:

اشتراک گذاری